A Study on Distance-based Outlier Detection on Uncertain Data
نویسنده
چکیده
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. The uncertainty information in the data is useful and can be used to improve the quality of the underlying results. Therefore in this dissertation, three problems are being solved related to outlier detection on uncertain data. 1) Distancebased outlier detection on uncertain data: In this research, we give a novel definition of distance-based outliers on uncertain data. Since the distance probability computation is expensive, a cell-based approach is proposed to index the dataset objects and to speed up the outlier detection process. The cell-based approach identifies and prunes the cells containing only inliers based on its bounds on outlier score (#D-neighbors). Similarly it can also detect the cells containing only outliers. 2) Topk outlier detection on uncertain data: In this work, a topk distance-based outlier detection approach is presented. In order to detect top-k outliers from uncertain data efficiently, we propose a data structure, populated-cells list (PC-list). Using the PC-list, the top-k outlier detection algorithm needs to consider only a fraction of the dataset objects and hence quickly identifies candidate objects for the top-k outliers. 3) Continuous outlier detection on uncertain data streams: In this part of the dissertation, a distance-based approach is proposed to detect outliers continuously from a set of uncertain objects’ states that are originated synchronously from a group of data sources (e.g., sensors in WSN). A set of objects’ states at a timestamp is called a state set. Usually, the duration between two consecutive timestamps is very short and the state of all the objects may not change much in this duration. Therefore, to eliminate the unnecessary computation at every timestamp, an incremental approach of outlier detection is proposed which makes use of outlier detection results obtained from the previous timestamp to detect outliers in the current timestamp. Finally, extensive experimental evaluations on real and synthetic datasets are presented for each of the proposed outlier detection approaches, to prove their accuracy, efficiency and scalability.
منابع مشابه
Top-k Distance-Based Outlier Detection on Uncertain Data
This paper studies the problem of top-k distance-based outlier detection on uncertain data. In this work, an uncertain object is modelled by a Gaussian probability density function. Since the Naive approach is very expensive due to costly distance function between uncertain objects, a populated-cell list (PC-list) based top-k distance-based outlier detection approach is proposed in this work. W...
متن کاملDistance-Based Outlier Detection on Uncertain Data of Gaussian Distribution
Managing and mining uncertain data is becoming important with the increase in the use of devices responsible for generating uncertain data, for example sensors, RFIDs, etc. In this paper, we extend the notion of distance-based outliers for uncertain data. To the best of our knowledge, this is the first work on distance-based outlier detection on uncertain data of Gaussian distribution. Since th...
متن کاملFast Top-k Distance-Based Outlier Detection on Uncertain Data
This paper studies the problem of top-k distance-based outlier detection on uncertain data. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. We start with the Naive approach. We then introduce a populated-cell list (PC-list), a sorted list of non-empty cells of a grid (grid is used to index our data). Using PC-list, our top-k outlier de...
متن کاملOutlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator
The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...
متن کاملAn Efficient Representation Model of Distance Distribution Between Two Uncertain Objects
In this paper, we consider the problem of efficient computation of distance distribution between two uncertain objects. It is important to many uncertain query evaluation (e.g., range queries, nearest-neighbour queries) and uncertain data mining (e.g., classification, clustering and outlier detection). However, existing approaches involve distance computations between samples of two objects, wh...
متن کامل